Predicting the Injuriousness of Traffic Collisions Occuring in the City of Chicago: A Final Report

Final Project
Data Science 2 with R (STAT 301-2)

Author

Donny Tou

Published

March 6, 2024

Github Repo Link

Introduction

In this project, I will investigate the following predictive question: will any of the unfortunate individuals involved in a traffic collision — both motorists and non-motorists alike — emerge injured? This is a classification problem because it investigates a question of whether a traffic incident will be injurious instead of how many people are injured.

I want to investigate this predictive problem in particular because of the ubiquity of driving: more than 80 percent of U.S. adults are licensed drivers, with over 90 percent using motor vehicles to transport themselves to-and-from work (Hedges & Company, 2018; Dews, 2013). Driving is relatively cheap and time-efficient, providing individuals with a large amount of geographic freedom and autonomy. Yet, there are significant costs to driving: more cars on the road often corresponds to a higher frequency of traffic collisions, many of which can result in debilitating injuries and even death. In fact, the U.S. experiences more motor-vehicle fatalities in both absolute and per-capita terms than any other high-income country (Yellman & Sauber-Schatz, 2022). In attempting to predict the “price tag” of driving (in the form of injurious collisions), I hope to minimize the heavy costs associated with an activity that has become so prevalent in, and important to, daily life.

To build my predictive modelling process, I will be analyzing collision-level crash data covering traffic incidents occurring within City of Chicago limits and under the jurisdiction of the Chicago Police Department since 20151. Approximately half of the observations are self-reported at the police district by the agent(s) involved; the other half are recorded by the responding police officer.

Data Overview

In its raw form, my dataset on Chicago traffic collisions describes over 800,000 crashes with 50 columns which — after basic cleaning (cleaning column names, factorizing relevant variables, collapsing factor levels) — consist of 27 factor variables, 17 integer variables, 5 string variables, and 1 logical variable. A dataset-wide missingness analysis reveals the following results.

Table 1: Total missingness for each of the 50 variables, summarized
Figure 1: Total missingness for each of the 50 variables, visualized

Table 1 reinforces Figure 1 by showing that 11/50 variables in the data see a missingness rate of over 65%, with the top 8 seeing missing values for over 90% of the entire dataset — a magnitude of missingness that is certainly concerning, especially since it limits my flexibility for feature engineering/selection. Fortunately, the rate of missingness drops precipitously for the remaining 39 variables, which see either zero or close-to-zero missingness.

To construct my outcome variable of interest, I condition it using if_else() on one of the preexisting variables, injuries_total, which is an integer measurement of the number of individuals sustaining fatal, incapacitating, non-incapacitating, or possible injuries within a given traffic collision. The resulting binary outcome variable — injurious — is appended to the dataset as a new column and is coded as:

Yes if injuries_total exceeds zero;
No if injuries_total equals zero; and
NA if injuries_total is a missing value

I have chosen to take the “classification route” with the new injurious variable — as opposed to the “regression route” with existing injuries_total variable — so as to preserve the injury-focused nature of my initial question whilst, at the same time, streamline the ensuing data downsizing/balancing process2.

  • 2 Described in the following section

  • The following exhibits will explore the missingness and distribution of my injurious outcome variable using the original data.

    Table 2: Extent of missingness for outcome variable injurious
    variable n_miss pct_miss
    injurious 1757 0.2187653

    The outcome variable injurious fortunately sees a missingness rate of only 0.22% (Table 2), which means that the process of “throwing out” missing values during recipe building should not generate significant disruptions/bias. Next, I will explore the distribution of injurious across the raw dataset after excluding the 1,757 (out of 803,144) observations that lack data on injuriousness.

    Figure 2: Distribution of injurious, visualized
    Table 3: Distribution of injurious, summarized
    injurious n
    Yes 110241
    No 691146

    Figure 2 and Table 3 reveal a large amount of class imbalance in injurious: non-injurious collisions significantly outnumber injurious collisions, with the ratio between the two classes exceeding 6:1. This warrants 2 additional steps that will be taken — dataset downsizing and stratified random sampling — which will be described in further detail in the next section.

    Specifically, this next section will explore in detail my steps of data splitting, model building/tuning, and recipe engineering.

    Methods

    Dataset Downsampling/Downsizing

    Prior to spending my dataset, I first run it through 2 prerequisite steps: downsampling and downsizing. The sheer size of my dataset in its raw form (with 800,000+ rows) is unsustainable given my computational and temporal constraints; additionally, the class imbalance existing within my binomial outcome variable, injurious, warrants adjustments. I address both of these concerns in tandem through these 2 steps:

    1. Downsampling: After throwing out missing data (~0.22% of rows), I use slice_sample() to randomly downsample my dataset with respect to the underrepresented class in injurious, “Yes”. This reduces the number of observations from over 800,000 to about 220,0003 and at the same time ensures a 1:1 class balance in my dummy outcome variable.
    2. Additional downsizing: A 220,000-row dataset still exceeds my computational limits; as such, I further downsize my sample via random selection using initial_split(). The result is a ~44,000-row final dataset, exactly 20% as large as post-downsampling dataset and roughly 5-6% as large as the original dataset.
  • 3 The number of observations in the underrepresented outcome variable class — collisions that are injurious — is about 110,000 in the raw dataset

  • Data Splitting/Resampling

    Next, I use an 80:20 proportion combined with stratified random sampling (with respect to injurious) in initial_split() in order to split this final dataset into training and testing sets. Within the training set of roughly 35,000 rows, I then use V-fold cross-validation to generate resampled data on which I later conduct my model competition process. Specifically, my resampling process entails randomly partioning my training set into 5 subsets (v = 5) repeated 3 times (repeats = 3), generating 15 resamples/folds — each of which contains roughly 7,000 observations4 — on which my various models are trained and evaluated.

  • 4 Within each 7,000-row fold, 80% of rows are allocated to training and the remaining 20% are allocated to testing since v = 5

  • <Training/Testing/Total>
    <35276/8820/44096>
    Table 4: Data set observation count post-splitting
    # of training set rows post-split
    n
    35276
    # of testing set rows post-split
    n
    8820

    The series of visuals in Table 4 reveal that the 80:20-proportion split has been successfully implemented.

    Model Building/Tuning

    I define and, using a regular grid, tune the 7 following models5 — the first 2 being baseline models — for use in my model competition:

  • 5 To accomodate the binomial nature of injurious, all 7 models use mode = “classification”

    1. Null: A simple baseline null model defined using null_model() with the parsnip engine.
    2. Naive Bayes: A simple “step-up” baseline model defined using naive_Bayes() with the klaR engine.
    3. Logistic regression: A parametric non-regularized regression model defined using logistic_reg() with the glm engine.
    4. Elastic net: A parametric regularized regression model defined using logistic_reg() with the glmnet engine and the following tuning parameters:
      1. Mixture explored over [0, 1] with 10 levels
      2. Penalty explored over [-3, 0] with 10 levels
    5. K-nearest neighbors: A non-parametric algorithm defined using nearest_neighbor() with the kknn engine and the following tuning parameter:
      1. Neighbors explored over [1, 15] with 5 levels
    6. Random forest: A non-parametric, independently-trained algorithm defined using rand_forest() with the ranger engine, 500 trees, and the following tuning parameters:
      1. Number of predictors randomly sampled at each split explored over [1, 5] with 4 levels
      2. Minimum number of node data points required for further splitting explored over [2, 40] with 4 levels
    7. Boosted tree: A non-parametric, sequentially-trained algorithm defined using boost_tree() with the xgboost engine, 500 trees, and the following tuning parameters:
      1. Number of predictors randomly sampled at each split explored over [1, 5] with 4 levels
      2. Minimum number of node data points required for further splitting explored over [2, 40] with 4 levels
      3. Learning rate explored over [-5, -0.2] on log-10 scale with 4 levels

    Recipe Engineering

    Next, the recipes I use in conjuction with my 7 models are constructed along 2 independent dimensions:

    1. “Kitchen sink” .vs. “feature engineered”: These differ on the basis of feature selection; my kitchen sink recipe uses as many predictors as possible, while my feature engineered recipe is more selective with the predictors used
      1. The kitchen sink feature selection only filters out 26 “unacceptable” predictors6 which are variables that:
        1. See missingness rates in excess of 90% (e.g. workers_present_i);
        2. Are too closely correlated with injurious (e.g. injuries_total);
        3. Contain too many factor levels, often because they serve as identifiers (e.g. crash_record_id)
      2. The feature engineered selection, on the other hand, actively includes 11 predictors7alignment, posted_speed_limit, lane_cnt, intersection_related_i, trafficway_type, device_condition, report_type, first_crash_type, num_units, lighting_condition, and month — which I select on the basis of having observable/notable bivariate relationships with injurious8
    2. Parametric .vs. non-parametric: Recipes that are compatible with my 3 parametric models (null, logistic regression, and elastic net) differ from recipes meant for my 3 non-parametric models (nearest neighbors, random forest, and boosted tree) in 2 ways…
      1. Unlike their parametric counterparts, my non-parametric recipes use one-hot encoding when converting factor variables into numeric terms
      2. Unlike their non-parametric counterparts, my parametric recipes incorporate interaction terms between 5 predictors9: lighting_condition, num_units, trafficway_type_, intersection_related_i, and alignment
  • 6 View Appendix: Technical Info to see which 26 variables are filtered out

  • 7 refer to README.md within the data/ subdirectory for variable definitions

  • 8 Refer to Appendix: EDA for univariate/bivariate/multivariate analyses of predictors in relation to my predictive problem

  • 9 This comparison only applies for the feature engineered parametric/non-parametric recipes; the kitchen sink recipe is intentionally kept simple in its omission of interaction terms

  • These 2 dimensions alone generate 4 possible recipe combinations: parametric + kitchen sink, parametric + feature engineered, non-parametric + kitchen sink, and non-parametric + feature engineered. In addition to their differences, all models share the same basic pre-processing steps of a) imputing missing predictor values using nearest neighbors; b) dummy-encoding all factor predictors; c) removing predictor variables with zero variance; and d) centering/scaling all numeric predictors.

    Importantly, I include an additional recipe designed exclusively for the naive Bayes model; this recipe is identical to the parametric + kitchen sink model, except it omits the pre-processing step of dummy-encoding factor variables. I therefore end up with 5 total recipes: the 4 combinations detailed above and the naive-Bayes-specific specification.

    Assessment Metric

    So that I can systematically compare the predictive performances of my various models and their recipe specifications, I will use the accuracy assessment metric — which measures the proportion of observations guessed correctly by a given model — for its easy interpretability as well as its compatibility with the binomial nature of my injurious outcome variable.

    Model Building & Selection Results

    Model Candidates

    In the ensuing model competition process, my 2 baseline models (null and naive Bayes) are individually matched with 1 recipe10, while the other 5 more-complex models are each individually matched with 2 recipes. The recipe-by-recipe breakdown is as follows:

  • 10 The naive Bayes model uses its designated recipe specification while the null model uses the parametric + kitchen sink recipe specification

    1. Naive Bayes recipe only: matched only to the naive Bayes model;
    2. Parametric + kitchen sink recipe only: matched only to the null model;
    3. Parametric + kitchen sink recipe AND parametric + feature engineered recipe: matched to the logistic regression and elastic net models;
    4. Non-parametric + kitchen sink recipe AND non-parametric + feature engineered recipe: matched to the nearest neighbors, random forest, and boosted tree models

    Fitted Results: General Takeaways

    The following exhibit displays the accuracy results of the best-performing candidates for each model/recipe combination11, fitted and then averaged across the 15 resamples/folds:

  • 11 For context, complex models that are fit using kitchen sink recipes are denoted with “1”, and complex models that are fit using feature engineered recipes are denoted with “2”

  • Table 5: Model competition results using the accuracy metric

    Below is the same table, but rearranged such that the best-performing models are at the top:

    Table 6: Model competition results using the accuracy metric, arranged in descending order

    There are 3 general takeaways from Table 5 and Table 6:

    1. The top-performing individual model per the accuracy metric is the boosted tree model fit using the kitchen sink recipe
    2. Across all complex model types, the kitchen sink recipe strictly dominates the feature engineered recipe when it comes to predictive accuracy
      1. This suggests that, in my case, feature engineering is not worth it: it adds additional effort and yields worse results
      2. This does not necessarily suggest, however, that a kitchen sink strategy for feature selection is in general strictly superior to a more selective one
        1. In my case, it is likely that I may have simply omitted key variables that I should have included in the construction of my feature engineered recipe
    3. Among the top-performing complex models, there appear to be close-to-zero differences in predictive performance
      1. For instance, the difference in mean accuracy between my top performer (boosted1) and the “runner-up” (rf1) is only 0.0003496, which is merely a 0.0457% difference in performance
      2. Nonetheless, the 5 non-baseline models do appear to perform with greater predictive accuracy than the baseline null and nbayes models at a non-insignificant level

    The Top-Performing Model Candidate

    Table 6 reveals that, holding the recipe constant, there are extremely small differences accuracy-wise between the boosted tree, random forest, logistic regression, and elastic net models to the point that it becomes difficult to definitively settle on a “best” model candidate. For the purpose of this project, however, will I choose boosted1 — the boosted tree model fit using the kitchen sink recipe — to be my final model candidate; I do this for 2 overarching reasons:

    1. Even though the random forest model outperforms the boosted tree model on the feature engineered recipe (compare rf2 to boosted2), it has already been established that the kitchen sink recipe is strictly dominant regardless of model type
      1. To this point, the best-performing candidate on the kitchen sink recipe is the boosted tree model
    2. Since the difference in predictive performance between the top candidates is near-zero, the “best” option for me is still to default to the top-performing candidate — regardless of the gap in performance — unless I have a compelling external reason to not do so
      1. In this case, the fact that the “performance gaps” between candidates remain consistently small across both recipes tells me that there appears to be no compelling reason to actively avoid the boosted tree model

    More specifically, the top-performing model (boosted1) on average correctly predicts the injury status of approximately 76.60% of collisions across the 15 resamples/folds. Note, however, that this winning candidate represents only 1 of the 64 boosted tree models created via tuning combinations during the model-building process. Therefore, the following exhibits explore in further detail the particular tuning parameters of the the winning boosted1 model:

    Figure 3: A visual inspection of tuning parameter performances for the boosted tree model class

    Figure 3 reveals a few notable findings regarding boosted tree tuning:

    1. Top-performing boosted tree models per the accuracy metric appear to cluster around a particular point in the bottom-right graph (learn_rate = -1.8, log scale)
      1. At this point, mtry equals its maximum of 5 while min_n equals its minimum of 2
    2. A higher mtry value over a range of [1, 5] generally corresponds to greater model performance but this cannot be extrapolated across all cases
      1. Notably, this trend reverses at the highest possible learn_rate over [-5, -0.2] of -0.2 (in log scale)
    3. A lower min_n value over a range of [2, 40] generally corresponds to greater model performance, but this cannot be extrapolated across all cases
      1. In parallel to above’s finding, this trend reverses at the highest possible learn_rate of -0.2 (in log scale)

    The endline result from Figure 3 regarding our singular best-performing boosted tree model can be summarized in the following table:

    Table 7: Exact hyperparameter values for the top-performing model, boosted1
    mtry min_n learn_rate .config
    5 2 0.0158489 Preprocessor1_Model36

    The following statement can be drawn from Table 7: conditional on having a learn_rate equaling -1.8 (log-scaled), my boosted tree model performs with the greatest predictive accuracy using the smallest possible min_n and largest possible mtry values of 2 and 5 respectively. The most puzzling conclusion from this tuning analysis is that there is no clear-cut answer regarding the optimal learn_rate; it does not seem to follow a predictable “ceteris-paribus” pattern with respect to accuracy, which is particularly concerning because the value of learn_rate appears to also determine the optimal hyperparameter values for min_n and mtry as well. As such, a case can be made that further tuning should be explored with respect to the learn_rate parameter; perhaps a more optimal value exists and can potentially be uncovered via, for example, an exploration conducted across more levels.

    Additional Tuning Analysis: Other Candidates

    The following exhibit explores the optimal tuning parameter values for the best-performing representatives of the 3 other model types marked for tuning: random forest, K-nearest neighbors, and elastic net.

    Table 8: Optimal hyperparameter values for the

    Table 8 reveals that the optimal tuning hyperparameters for each model are generally consistent across the kitchen sink and feature engineered recipes; the only exception to this is the elastic net model, whose optimal hyperparameter value for penalty slightly differs between the two. As for the remaining 2 models:

    1. K-nearest neighbors model: Its predictive performance is optimized with the largest possible number of neighbors (15)
    2. Random forest model: Its predictive performance is optimized with the largest mtry value (5) and a moderate number of min_n (27)

    Building More Complex Models: Is It Worth It?

    Perhaps one of the most important queries to investigate in the context of model competition/selection is whether it is truly worth it to build more complex models which, in my case, are the logistic regression, elastic net, K-nearest neighbors, random forest, and boosted tree models. Given my results from Table 6, I draw the following 2 conclusions:

    1. At a fundamental level, it appears to be certainly worth it to build more complex models beyond the baselines
      1. Relative to the null baseline model, the top 7 competitors see accuracy rates at least 25 percentage points (or 50%) higher
      2. Relative to the naive Bayes baseline model, the same top 7 competitors see accuracy rates around 10-11 percentage points (or 15-17%) higher
    2. There is, however, a catch: conditional on having more-complex models, it does not appear to be worth it to introduce even more complexity
      1. For instance, boosted2 performs only slightly better than en2 using the same underlying recipe

    Final Model Analysis

    The final component of my predictive modelling process entails fitting my best-performing model workflow — boosted1 — to my testing set of roughly 9,000 observations. Prior to this, I take 2 prerequisite steps:

    1. I extract the underlying feature engineering specifications (i.e. the non-parametric kitchen sink recipe) as well as the optimal tuning hyperparameters (i.e. mtry = 5, min_n = 2, and learn_rate = -0.2 log-scaled) of my winning model, boosted1
    2. I then train the extracted model workflow on my whole ~35,000-row training dataset and save the result as a fitted model object

    Finally, I apply the resulting fitted model to my testing set of ~9,000 rows — all in order to see how this winning model performs on never-before-seen-data. The following exhibits display my results.

    Table 9: Tibble of predicted class probabilities and actual/true outcomes

    Table 9 provides a side-by-side comparison between the actual and predicted injurious values of the 8,820 traffic collisions within my testing set, as well as the class probabilities assigned to each of the 2 levels — Yes and No — per collision.

    Table 10: Performance summary of boosted1 on testing data using accuracy
    .metric .estimator .estimate
    accuracy binary 0.764966

    Using the accuracy metric, Table 10 reveals that the winning boosted tree model correctly predicts the injury status of a given collision 76.497% of the time within the testing set.

    Now, let’s further decompose this predictive assessment using a confusion matrix.

    Figure 4: A 2x2 confusion matrix comparing predicted .vs. actual injurious class values.

    Figure 4 confirms our finding from Table 10: (3361 + 3386) / (3361 + 3386 + 1024 + 1049) = 6747 / 8820 = 76.497% of training-set traffic collisions are correctly predicted. Notably:

    1. 3,386 predictions are true positives — injurious collisions that are correctly predicted to be injurious by the model
    2. 3,361 predictions are true negatives — non-injurious collisions that are correctly predicted to be non-injurious by the model
    3. 1,049 predictions are false positives — collisions that are predicted to be injurious but are actually non-injurious
    4. 1,024 predictions are false negatives — collisions that are predicted to be non-injurious but are actually injurious

    Overall, a final accuracy metric of 76.497% on the assessment data is a slight decline from 76.60% — the final model’s accuracy metric on the resampled data — which is a 0.103-percentage-point decrease. This is to be expected, since boosted1, which was trained across the resampled training folds, is now fit on “never-before-seen” data.

    Overall, I believe that this is a solid performance: even on never-before-seen data, this final model’s accuracy metric is a) over 25 percentage points (50%) more accurate than the null model and b) over 11 percentage points (17%) more accurate than the naive Bayes model, both of which are assessed across the 15 resamples. While this is not a magnificent performance, I believe it does justify building more complex models beyond just the baselines.

    Still, this model is far from perfect; analyses from previous reports identify 3 key caveats to the efficacy of this final model:

    1. The underlying kitchen sink recipe matched with this recipe is not necessarily optimal
      1. Its superior performance against my feature engineered recipe may point towards a flaw in my feature engineering — not necessarily the dominance of a kitchen sink selection strategy
    2. The “objectively optimal” hyperparameter values for min_n, mtry, and learn_rate are still not exactly known
      1. This analysis merely identified the best-performing tuning set among 64 possible boosted-tree-model combinations, but more combinations exist and should be explored if computational constraints allow for it
    3. Although boosted1 is the top performer among candidates in Table 6, the differences across the upper 50% of candidates are marginal at-best
      1. If this model is “any good”, then it’s possible that the runner-up models are as well

    Conclusion

    In summary, this report details my journey of predicting traffic collision injury outcomes in Chicago. Utilizing a robust dataset and strategic downsampling/downsampling, the boosted tree model, fused with the “kitchen sink” recipe, emerges as the top performer with an average accuracy rate of 76.60% across resamples. The analysis delves into the nuanced relationships among hyperparameters, emphasizing the importance of tuning, especially for the learn_rate parameter. The final model, when evaluated on a separate testing set, maintains a solid accuracy rate of 76.497%, surpassing the null and naive Bayes baseline models. This report concludes by advocating for continuous refinement and exploration to enhance predictive modelling robustness — a practice that is especially important when it comes to devising targeted traffic interventions that can potentially save thousands of lives.

    References

    How many licensed drivers are there in the US? (2018). Hedges & Company. https://hedgescompany.com/blog/2018/10/number-of-licensed-drivers-usa/#:~:text=Across%20all%20age%20groups%2C%2084.1,population%20has%20a%20driver’s%20license

    Yellman, M. A. & Sauber-Schatz, E. K. (2022). Motor Vehicle Crash Deaths — United States and 28 Other High-Income Countries, 2015 and 2019. Morbidity and Mortality Weekly Report (MMWR), 71(26), 837-843. https://www.cdc.gov/mmwr/volumes/71/wr/mm7126a1.htm?s_cid=mm7126a1_w#suggestedcitation

    Appendix: EDA

    In this section, I detail the bivariate and multivariate EDA exhibits used to justify my predictor and interaction term selections for use in my feature engineered recipe. I also detail the univariate explorations conducted on factor-based features to explore potential class imbalances.

    The following EDAs are NOT conducted on the entire 800,000-row raw dataset; doing so would involve the 9,000 rows I use to assess my final model. Instead, I use throwaway data, which is the roughly 80% * 220,000 = 176,000 rows of data I “throw out” between my downsizing and downsampling stages. By doing this, I ensure no contamination of my assessment data during the exploratory stages of this project.

    Recall that the feature selection/engineering steps unique to recipe1_parametric and recipe1_tree consist of the following components:

    1. 11 predictors/“features” 12: alignment, posted_speed_limit, lane_cnt, intersection_related_i, trafficway_type, device_condition, report_type, first_crash_type, num_units, lighting_condition, and month
    2. 4 interaction terms created using 5 features — lighting_condition, num_units, trafficway_type_, intersection_related_i, and alignment — and defined as the following…
      1. An interaction term between num_units and lighting_condition
      2. An interaction term between num_units and trafficway_type
      3. An interaction term between alignment and lighting_condition
      4. An interaction term between alignment and intersection_related_i
  • 12 refer to README.md within the data/ subdirectory for variable definitions

  • The following 3 subsections are dedicated to bivariate, multivariate, and univariate EDA exhibits, respectively.

    Bivariate EDA: Feature Selection

    1. Street Alignment

    alignment: “Street alignment at crash location, as determined by reporting officer.”

    Figure 5: Bivariate Analysis — Alignment and Injuriousness
    Figure 6: Bivariate Analysis — Alignment and Injuriousness

    A higher proportion of collisions occurring on streets with curved alignments are injurious.

    2. Posted Speed Limit

    posted_speed_limit: “Posted speed limit, as determined by reporting officer.”

    Figure 7: Bivariate Analysis — Speed Limit and Injuriousness
    Table 11: Bivariate Analysis — Speed Limit and Injuriousness
    injurious Mean speed limit
    Yes 29.65187
    No 28.23470

    On average, injurious collisions occur on roadways with slightly-higher posted speed limits.

    3. Lane Count

    lane_cnt: “Total number of through lanes in either direction, excluding turn lanes, as determined by reporting officer (0 = intersection).”

    Table 12: Bivariate Analysis — Lane Count and Injuriousness
    injurious Mean lane count
    Yes 2.746064
    No 2.585954

    On average, injurious collisions occur on roadways with slightly more lanes.

    4. Intersection-Relatedness

    intersection_related_i: “A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection.”

    Figure 8: Bivariate Analysis — Intersection Relatedness and Injuriousness (1)
    Figure 9: Bivariate Analysis — Intersection Relatedness and Injuriousness (2)

    A higher proportion of collisions reported to be intersection-related are injurious.

    5. Trafficway Type

    trafficway_type: “Trafficway type, as determined by reporting officer.”

    Figure 10: Bivariate Analysis — Trafficway Type and Injury Status

    Collisions occurring in four-ways tend to be the most injurious; collisions occuring in parking lots tend to be the least injurious.

    6. First Crash Type

    first_crash_type: “Type of first collision in crash.”

    Figure 11: Bivariate Analysis — First Crash Type and Injuriousness

    Injurious collisions are more likely to involve non-motorists.

    7. Collision Report Type

    report_type: “Administrative report type (at scene, at desk, amended).”

    Figure 12: Bivariate Analysis — Administrative Report Type and Injuriousness

    Injurious collisions are more likely to be reported on-scene; non-injurious collisions are more likely to be reported at-desk.

    8. Number of Units Involved

    num_units: “Number of units involved in the crash. A unit can be a motor vehicle, a pedestrian, a bicyclist, or another non-passenger roadway user. Each unit represents a mode of traffic with an independent trajectory.”

    Figure 13: Bivariate Analysis — Collision Scope and Injuriousness
    Table 13: Bivariate Analysis — Collision Scope and Injuriousness
    injurious Mean number of units involved
    Yes 2.130747
    No 2.022371

    On average, injurious collisions tend to involved slightly more units.

    9. Traffic Device Condition

    device_condition: “Condition of traffic control device, as determined by reporting officer.”

    Figure 14: Bivariate Analysis — Traffic Device Condition and Injuriousness

    Strangely, injurious collisions are less likely to occur near poorly functioning traffic control devices. I would expect the opposite to be true.

    10. Month of Collision

    month: “The month component of [the collision’s date of occurrence].”

    Figure 15: Bivariate Analysis — Month Of Collision and Injuriousness
    Table 14: Bivariate Analysis — Month Of Collision and Injuriousness
    injurious Average collision month
    Yes 6.874865
    No 6.692368

    Injurious collisions tend to occur slightly later in a given year.

    11. Lighting Conditions

    lighting_condition: “Light condition at time of crash, as determined by reporting officer.”

    Figure 16: Bivariate Analysis — Lighting Condition and Injuriousness
    Figure 17: Bivariate Analysis — Lighting Condition and Injuriousness, factor levels collapsed

    Relative to their non-injurious counterparts, injurious collisions are slightly more likely to occur under non-daylight lighting conditions.

    Multivariate EDA: Interaction Term Selection

    Here, I will display the exploratory analyses conducted to discover and justify my 4 interaction terms.

    1) num_units * lighting_condition

    Does the effect of collision scope on injuriousness potentially depend on external lighting conditions?

    Figure 18: Injuriousness .vs. Collision Scope by Lighting Condition (Factor Levels Condensed)

    The positive association between incident scope (num_units) and injury status (injurious) appears to be much stronger for collisions occurring under daylight.

    2) num_units * trafficway_type

    Does the effect of collision scope on injuriousness potentially depend on external trafficway conditions?

    Figure 19: Injuriousness .vs. Collision Scope by Trafficway Type (Factor Levels Condensed)

    A distinct linear relationship between collision scope (num_units) and injury status (injurious) applies only to collisions for whom trafficway_type = “FOUR WAY”; for the other trafficway categories, the relationship is “U-shaped” and/or unclear.

    3) alignment * lighting_condition

    Does the effect of street alignment on injuriousness potentially depend on external lighting conditions?

    Figure 20: Injuriousness .vs. Street Alignment by Lighting Condition, Factor Levels Condensed

    The extent to which curved streets (alignment = “CURVE) contribute to higher injuriousness (injurious = YES) depends slightly on external lighting conditions.

    Univariate EDA: Predictor Class Imbalances

    8 of my 11 predictors used in the feature engineered recipes are categorical/factorial/nominal. As such, I dedicate the following univariate exhibits to exploring the class imbalances among them.

    1. Street Alignment

    Figure 22: Class Imbalance — Street Alignment
    Table 15: Street Alignment, Counted
    alignment n
    CURVE ON GRADE 321
    CURVE ON HILLCREST 105
    CURVE, LEVEL 1468
    STRAIGHT AND LEVEL 171535
    STRAIGHT ON GRADE 2403
    STRAIGHT ON HILLCREST 554

    2. Intersection Relatedness

    Figure 23: Class Imbalance — Intersection Relatedness
    Table 16: Intersection Relatedness, Counted
    intersection_related_i n
    N 2234
    Y 52797
    NA 121355

    3. Trafficway Type

    Figure 24: Class Imbalance — Trafficway Type
    Table 17: Trafficway Type, Counted
    trafficway_type n
    DIVIDED 41973
    FOUR WAY 16156
    NOT DIVIDED 75269
    ONE-WAY 18365
    PARKING LOT 8350
    Other 16273

    4. First Crash Type

    Figure 25: Class Imbalance — First Crash Type
    Table 18: First Crash Type, Counted
    first_crash_type n
    Motorist 144301
    Animal 106
    Object 10825
    Other 542
    Non-motorist 20612

    5. Collision Report Type

    Figure 26: Class Imbalance — Administrative Report Type
    Table 19: First Crash Type, Counted
    report_type n
    AMENDED 49
    NOT ON SCENE (DESK REPORT) 70116
    ON SCENE 99048
    NA 7173

    6. Traffic Device Condition

    Figure 27: Class Imbalance — Traffic Control Device Condition
    Table 20: Traffic Control Device Condition, Counted
    device_condition n
    Bad 92468
    Good 71516
    Unknown 12402

    7. Lighting Condition

    Figure 28: Class Imbalance — Lighting Condition
    Table 21: Lighting Condition, Counted
    lighting_condition n
    DARKNESS 8121
    DARKNESS, LIGHTED ROAD 43181
    DAWN 3098
    DAYLIGHT 111214
    DUSK 5255
    UNKNOWN 5517

    8. Month of Collision

    Figure 29: Class Imbalance — Month of Collision
    Table 22: Collision Month, Counted
    month n
    1 14000
    2 11776
    3 12528
    4 12431
    5 14760
    6 15256
    7 15697
    8 16093
    9 16334
    10 17069
    11 15284
    12 15158

    Appendix: Technical Info